Fix/eval scheduler actor wedge#238
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6e7e220. Configure here.
|
@ekirimlioglu Thanks for the detailed diagnosis and patch! I verified that both failure modes are real and your fix follows the existing init-failure handler pattern, which is the right shape. Before the RapidFire AI team can proceed with reviewing and merging this, please open a GitHub Issue first describing the bugs, following the repo's Issue conventions. Specifically:
Then please reference that Issue in this PR as the one being resolved. This will help the QA team reproduce the bugs independently and validate the fix. It will also keep the project's bug-tracking conventions consistent. Overall, thanks for the great contribution. First-time PRs of this depth are rare! |
|
Filed #239 with the repro, environment, and end-to-end validation notes. Small suggestion: CONTRIBUTING.md doesn't currently mention the file-an-issue-first step. Adding a short note under "Submitting a pull request" could help future first-time contributors. |
anay-rfai
left a comment
There was a problem hiding this comment.
@ekirimlioglu Good catch and great work on the fix.
controller.py has gone through significant changes on main since you forked, most notably #237 (api gateway) and #243 (multimodal RAG). Could you update your branch against latest upstream/main before we merge?
- Update your fork either with git merge or rebase against upstream/main — your call, both are fine. We'll squash-merge at the end.
- Heads-up: the auto-merge will likely succeed without conflict, but please visually verify that your dispatch try/except still wraps the actor.process_batch.remote(...) loop and the active_tasks[actor_id] = {...} assignment in the new code shape. The surrounding regions were restructured.
- Re-run tests against the updated branch to confirm the wedge repro still passes.
Once that's pushed, this is good to approve and a maintainer will squash-merge. Thanks!
Two changes to evals/scheduling/controller.run_multi_pipeline_inference
that close a deterministic hang we hit on a 10-config grid run.
Symptom: MainThread parks in `time.sleep(0.5)` at the "Check if all
actors busy" branch while Ray reports 0% GPU/CPU usage. py-spy stack
shows the loop spinning; the bookkeeping says actors are busy but Ray
says they are idle. The hang is permanent.
Root cause has two triggers:
(1) Exception during the batch-submission block. The init step
(`actor.initialize_for_pipeline.remote(...)`) is already wrapped
by an existing try/except that calls `scheduler.remove_pipeline`,
but the subsequent `actor.process_batch.remote(...)` loop, the
`active_tasks[actor_id] = {...}` assignment, and the three
`db.set_actor_task_*` writes are not. If any of these raise
synchronously (serialization error, transient DB hiccup, etc.)
the actor was already marked busy by `scheduler.schedule()` and
leaks busy state; the controller never calls `set_completed_task`
or `remove_pipeline` for it.
(2) Actor death after dispatch. Once batches are submitted and tracked
in `active_tasks[actor_id]["futures"]`, the only completion-reaping
path is `ray.wait(futures, timeout=0)` -- a non-blocking poll. If
the actor process dies (OOM-kill, segfault from a broken native
dep, etc.) the futures become orphaned: Ray will eventually surface
them as failed, but `ray.wait(timeout=0)` only catches what is
already ready in this Python tick. The controller falls into the
"all actors busy" branch, sleeps 0.5s with no health check, and
repeats forever.
Fixes:
- Wrap the batch-submission + bookkeeping block in try/except that
mirrors the existing init-failure handler: mark the pipeline FAILED,
drop any partial active_tasks entry, call `scheduler.remove_pipeline`
so the actor is freed, and continue.
- Replace the bare `time.sleep(0.5)` in the busy-loop with
`ray.wait(all_pending, num_returns=1, timeout=0.5)`. ray.wait
surfaces both successful and failed futures (including
RayActorError from dead actors), so the next iteration's reap path
at the "Check for completed tasks" block frees the actor via the
existing exception handler. Falls back to time.sleep when no futures
are dispatched yet (initial loop entry).
The two changes are independent and complementary: either alone fixes
a subset of triggers; together they close the wedge.
Adds tests/test_pipeline_scheduler.py with 6 unit tests for the busy/free
transitions in PipelineScheduler. They document the invariant the
controller's scheduling loop relies on: an actor is marked busy by
schedule() and MUST be freed by either set_completed_task() (success)
or remove_pipeline() (failure). If the controller dispatch path fails
without calling one of those, the actor leaks busy state and the loop
wedges.
Also adds tools/repro_scheduler_wedge.py, a Ray-free standalone
reproduction of the bug. Scenario 1 simulates the buggy controller
path (dispatch fails, no remove_pipeline call) and detects the wedge
by counting consecutive "all actors busy" returns from schedule().
Scenario 2 demonstrates recovery via remove_pipeline. Useful for
maintainers to confirm the bookkeeping contract without spinning up
the full evals stack.
Existing tests on this branch:
pytest tests/test_pipeline_scheduler.py -> 6 passed
python tools/repro_scheduler_wedge.py -> exits 0, prints
"BUG REPRODUCED + FIX VERIFIED"
Tighten the new comments in controller.py, drop narrative docstrings from test_pipeline_scheduler.py, and simplify repro_scheduler_wedge.py output. No behavior change.
Drop the standalone tools/repro_scheduler_wedge.py and add the wedge scenario as a regression test in test_pipeline_scheduler.py. The unit test encodes the same invariant deterministically and runs in CI, so the standalone script no longer pulls its weight.
…ompletion The previous busy-branch passed every in-flight batch future to ray.wait(num_returns=1, timeout=0.5). ray.wait is a snapshot of readiness, not an event consumer -- once any single batch resolved (which happens early in a multi-batch task), that already-ready ref permanently satisfied num_returns=1 and the 0.5s timeout never engaged. Combined with the reap path only draining when all of an actor's futures are ready, the all-actors-busy branch hot-spun the IC poll and scheduler.schedule() at thousands of iterations per second instead of the original two per second. Partition first via a non-blocking ray.wait and only block on the not-yet-ready set. Dead-actor detection is preserved: a RayActorError-bearing future is "ready" to ray.wait, lands in the ready partition, and gets drained by the next iteration's reap path via the existing handler at controller.py:1312.
5728903 to
1d33b88
Compare
|
Rebased. 5 commits replayed cleanly with no conflicts. Visually confirmed the dispatch try/except still wraps the process_batch.remote() loop, the active_tasks[actor_id] = {...} assignment, and the three db.set_actor_task_* writes (controller.py L1939-L1983 in the new shape). The except block still calls scheduler.remove_pipeline(pipeline_id) so the actor is freed on any synchronous failure during batch dispatch. The busy-loop ray.wait partition fix also survived at L1796-L1804. Tests pass: pytest tests/test_pipeline_scheduler.py -> 7 passed in 0.96s The regression test (test_actor_leaks_busy_when_neither_completion_nor_removal_called) still shows 20/20 busy-sentinel returns when the controller-side cleanup is skipped, confirming the wedge condition is still reachable in PipelineScheduler and the controller patch is what prevents |

Changes
a. An exception during the batch-submission/bookkeeping block left the actor marked busy by PipelineScheduler.schedule() with no path to free it. The block is now wrapped in a try/except that mirrors the existing init-failure handler — marks the pipeline FAILED, drops the partial
active_tasks entry, and calls scheduler.remove_pipeline so the actor is released.
b. Actor death after dispatch (OOM, segfault) left orphaned futures and the controller's busy-branch was a blind time.sleep(0.5). Replaced with ray.wait(all_pending, num_returns=1, timeout=0.5) so failed futures surface and the existing reap path frees the actor on the next iteration.
Changelog Content
Additions
Changes
Fixes
Testing
Screenshots (if applicable)
N/A — backend-only change.
Checklist
Testing
Screenshots (if applicable)
Add screenshots to help explain your changes.
Checklist
Performance Impact
If this PR affects performance, describe the impact and any optimizations made.
Related Issues
Fixes #(issue number)
Closes #(issue number)
Related to #(issue number)
Note
Medium Risk
Touches the multi-pipeline scheduling loop and Ray future waiting semantics; mistakes could cause new hangs or premature failures under load.
Overview
Fixes a deterministic hang in
Controller.run_multi_pipeline_inferenceby (1) wrapping batch dispatch/bookkeeping in atry/exceptthat marks the pipeline FAILED and callsscheduler.remove_pipeline()so the actor is freed, and (2) replacing the busy-looptime.sleep()with aray.wait()on not-yet-ready futures so dead-actor failures surface promptly.Adds
tests/test_pipeline_scheduler.pyto codify scheduler bookkeeping invariants (busy actors must be freed viaset_completed_taskorremove_pipeline) and to cover the regression wedge scenario.Reviewed by Cursor Bugbot for commit 1d33b88. Bugbot is set up for automated code reviews on this repo. Configure here.